In this notebook, we will perform a causal analysis using observational data from the 1985 Current Population Survey (CPS).
The goal of this analysis is to estimate the causal effect of education on wages, taking into account potential confounding variables.
The dataset contains information on various demographic, economic, and labor-related variables.
CPS = read.csv("CPS1985.csv")
head(CPS, 10)
The dataset includes 12 variables that capture individual social and economic characteristics. These variables can be categorized as follows:
Numerical Variables:
wage: The wage of the individual.
education: The number of years of education.
experience: The number of years of work experience.
age: The age of the individual.
Categorical Variables:
ethnicity: The ethnicity of the individual.
region: The region where the individual lives.
gender: The gender of the individual.
occupation: The occupation of the individual.
sector: The sector of employment.
union: Union membership status.
married: Marital status.
print(colSums(is.na(CPS)))
## X wage education experience age ethnicity region
## 0 0 0 0 0 0 0
## gender occupation sector union married
## 0 0 0 0 0
There’s no missing values across variables, so we can proceed with the analysis.
Let’s see the distribution of quantitative variables.
numerical_vars <- c("wage", "education", "experience", "age")
summary(CPS[numerical_vars])
## wage education experience age
## Min. : 1.000 Min. : 2.00 Min. : 0.00 Min. :18.00
## 1st Qu.: 5.250 1st Qu.:12.00 1st Qu.: 8.00 1st Qu.:28.00
## Median : 7.780 Median :12.00 Median :15.00 Median :35.00
## Mean : 9.024 Mean :13.02 Mean :17.82 Mean :36.83
## 3rd Qu.:11.250 3rd Qu.:15.00 3rd Qu.:26.00 3rd Qu.:44.00
## Max. :44.500 Max. :18.00 Max. :55.00 Max. :64.00
print(paste("Sample size:", nrow(CPS)))
## [1] "Sample size: 534"
We can observe that the dataset includes 534 adult people of age ranging from 18 to 64 years, with 2 to 18 years of education and 0 to 55 years of work experience, so the variability among the observed sample of individuals is quite large. The same applies to wages: they range from 1 to 44.5 dollars per hour.
library(ggplot2)
library(tidyverse)
## Warning: package 'forcats' was built under R version 4.4.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# install.packages("ggiraph")
library(ggiraph)
## Warning: package 'ggiraph' was built under R version 4.4.1
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.1
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
CPS_long <- CPS %>%
pivot_longer(cols = c(wage, education, experience, age), names_to = "variable_name", values_to = "value")
quantiles <- CPS_long %>%
group_by(variable_name) %>%
summarise(
Q1 = quantile(value, 0.25),
Median = quantile(value, 0.50),
Q3 = quantile(value, 0.75)
)
density_plots <- ggplot(data = CPS_long, aes(x = value)) +
geom_density(fill = "steelblue", color = "black", alpha = 0.7) +
facet_wrap(~ variable_name, scales = "free") +
labs(title = "Distribution of Numerical Variables", x = NULL, y = "Frequency") +
theme_minimal() +
geom_vline(data = quantiles, aes(xintercept = Q1), color = "blue", linetype = "dashed", size = 0.5) +
geom_vline(data = quantiles, aes(xintercept = Median), color = "red", linetype = "dashed", size = 1) +
geom_vline(data = quantiles, aes(xintercept = Q3), color = "blue", linetype = "dashed", size = 0.5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplotly(density_plots, width = 800, height = 600)
As we can see from the plots, the distribution of wages, experience and age is right-skewed, meaning that most individuals have lower values for these variables. The distribution of education is more evenly spread, having median value at around 12 years of education.
Since the variables are skewed, there might be some outliers. Let’s have a look at the boxplots to see if there are any.
boxplots <- ggplot(CPS_long, aes(x = variable_name, y = value)) +
geom_boxplot(fill = "steelblue", color = "black", alpha = 0.7) +
facet_wrap(~ variable_name, scales = "free", ncol = 4) +
labs(title = "Distribution of Numerical Variables", x = NULL, y = "Value") +
theme_minimal()
ggplotly(boxplots, width = 800, height = 600)
There are visible outliers in the data:
Education: there are people who have less than 8 years of education, which probably means they finished only a part of primary education. The number of such observations is not high and we assume that these are valid data points, since a small fraction of population indeed might not finish school.
Experience: there are a couple of observations with more than 49 years of work experience (which is the 75% quantile of the data). A older fraction of population could in fact start their careers early and have a long work experience, so we don’t consider these observations problematic outliers, given that the maximum age within the sample is 64 years.
Wage: there are a few observations with wages higher than 30 dollars per hour. The wage distribution normally is right-skewed, so these observations are not problematic outliers, but rather a low number of high earners.
Let’s now observe the corellation between the numerical variables of interest.
numerical_vars_df <- CPS[numerical_vars]
correlation_matrix <- cor(numerical_vars_df)
correlation_matrix_long <- as.data.frame(correlation_matrix) %>%
rownames_to_column(var = "Var1") %>%
gather(key = "Var2", value = "value", -Var1) %>%
filter(Var1 != Var2) %>%
filter(!duplicated(paste(pmin(Var1, Var2), pmax(Var1, Var2))))
print(correlation_matrix_long)
## Var1 Var2 value
## 1 education wage 0.38192207
## 2 experience wage 0.08705953
## 3 age wage 0.17696688
## 4 experience education -0.35267645
## 5 age education -0.15001947
## 6 age experience 0.97796125
The independent variable of interest - Wage - is positively correlated with other three numerical variables with the strongest correlation for education (~0.4). Surprisingly, years of experience and age and not strongly correlated with Wage.
The strongest positive correlation is observable between age and work experience, which is not suprising if we assume that almost all individuals were working throughout their lives.
What is also interesting is that education is not strongly correlated with experience, which might indicate that people with higher education levels might not necessarily have more work experience (it makes sense if we assume that individuals who were getting their education for a longer time, might start working later).
Let’s graphically represent the relationships between wage and education, experience, and age.
We chose to apply the Loess smoothing method to this plot because it this method in non-perametric and does not assume a linear relationship between the variables. Instead, Loess fits a smoothed curve that best describes the relationship, allowing us to explore potential non-linear patterns in the data more closely. This approach is particularly suitable since we are uncertain about the exact trend of the relationship.
Additionally, the dataset’s relatively small size means that individual data points can have a significant impact. By using Loess, which is sensitive to outliers and local variations, we can better understand how these factors influence the overall trend between wage and education.
scatterplot_education <- ggplot(CPS, aes(x = education, y = wage)) +
geom_point(size=1) +
geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95) +
labs(title = "Education And Wage Relationship",
x = "Years of Education",
y = "Wage") +
theme_minimal()
ggplotly(scatterplot_education, width = 800, height = 600)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
If we consider the scatterplot of wage and education, we can observe a positive relationship between the two variables, that was previously observed in the correlation matrix. Furthermore, the trend is rather consistent and the variability is not too high, which indicates that the relationship might be quite strong. The consistently increasing upward slope might signal that there is also no dimishing returns to more years of educations, at least in the sectors and the occupations that we observe in this dataset.
scatterplot_experience <- ggplot(CPS, aes(x = experience, y = wage)) +
geom_point(size=1) +
geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95) +
labs(title = "Experience And Wage Relationship",
x = "Years of Experience",
y = "Wage") +
theme_minimal()
ggplotly(scatterplot_experience, width = 800, height = 600)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Compared to education, work experience does not show a clear relationship with wage: variability is so high that it makes the graph visually noisy and without a clear distinguishable trend. This is consistent with the correlation matrix, which showed a weaker correlation between wage and experience compared to education.
Speaking of a trend in the fitted line - there is an upward slope from 0 to ~12 years of experiences which is follwed by the flat line. It might signal that the positive linearity of the relationship is not constant across the whole career and is higher in first 10-15 years of work experience, on average. There’s also a significant drop in wage for individuals with more than ~40 years of experience, but the number of observations are relatively low in this age agroup and the variability is so high that it does not allow to make any preliminary conclusions.
Since age is highly correlated with experience, we can expect similar results for the relationship between wage and age on average, but the exact trends might differ, so it might be useful to plot it as well:
scatterplot_age <- ggplot(CPS, aes(x = age, y = wage)) +
geom_point(size=1) +
geom_smooth(method="auto", se=TRUE, fullrange=FALSE, level=0.95) +
labs(title = "Experience And Wage Relationship",
x = "Age",
y = "Wage") +
theme_minimal()
ggplotly(scatterplot_age, width = 800, height = 600)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
As with the relationship between wage and experience, the relationship between wage and age is not exactly linearly positive. The fitted line shows a slight upward slope from 18 to 35 years of age, which is followed by a stagnation. It might indicate that wage gains are higher in first years of career, but then the wage does not grow or even decrease a bit. It is also consistent with the trend that we observed between wage and years of experience.
After preliminary correlation and exploratory analysis, we can now proceed with the hypothesis testing. Education was found to have the highest positive correlation with wages among quantitative variables (correlations between wages and age and experience were weaker). Qualitative variables, such as occupation and sector, also showed some differences in wage distribution and their relationship with wages might be interesting to explore further.
The set of hypotheses is formulated in the following way:
Let’s have a look at the relationship between the level of education, controlling for experience and age, and wages.
First, since age and experience are highly correlated, we will check for multicollinearity between these two variables, using Variance Inflation Factor (VIF). Let’s create models with and wothout collinear independent variables and compare the results:
# install.packages("stargazer")
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
model_numerical_1 <- lm(wage ~ education + experience + age, data = CPS)
model_numerical_2 <- lm(wage ~ education + experience, data = CPS)
model_numerical_3 <- lm(wage ~ education + age, data = CPS)
stargazer(model_numerical_1, model_numerical_2, model_numerical_3,
type = "text",
title = "Wage Vs. Education, Experience & Age",
align = TRUE,
column.labels = c("All numerical", "w/o Age", "w/o Experience"))
##
## Wage Vs. Education, Experience & Age
## ===========================================================================================
## Dependent variable:
## -----------------------------------------------------------------------
## wage
## All numerical w/o Age w/o Experience
## (1) (2) (3)
## -------------------------------------------------------------------------------------------
## education 0.948 0.926*** 0.821***
## (1.155) (0.081) (0.077)
##
## experience 0.128 0.105***
## (1.156) (0.017)
##
## age -0.022 0.105***
## (1.155) (0.017)
##
## Constant -4.770 -4.904*** -5.534***
## (7.043) (1.219) (1.279)
##
## -------------------------------------------------------------------------------------------
## Observations 534 534 534
## R2 0.202 0.202 0.202
## Adjusted R2 0.198 0.199 0.199
## Residual Std. Error 4.604 (df = 530) 4.599 (df = 531) 4.599 (df = 531)
## F Statistic 44.727*** (df = 3; 530) 67.217*** (df = 2; 531) 67.210*** (df = 2; 531)
## ===========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
First model includes all three variables, while the second model includes only education and experience (excluding age). The third model includes only education and age (excluding experience).
First model demonstrates highly inflated standard errors, most likely due to multicollinearity between experience and age.
Both 2nd and 2rd models show and statistical and quite high correlation between education and wage, while the respective coefficients for isolated experience and age variables are quite similar. it is indeed a sign of multicollinearity.
Let’s see the Variance Inflation Factor to check the power of multicollinearity in the 1st model:
# install.packages("car")
library(car)
## Warning: package 'car' was built under R version 4.4.1
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.1
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif1 <- vif(model_numerical_1)
print(vif1)
## education experience age
## 229.5738 5147.9190 4611.4008
The calculated VIF is extremely high, which indicates that multicollinearity is a significant issue in the model.
Let’s check VIF for the 2nd and 3rd models:
vif2 <- vif(model_numerical_2)
print(vif2)
## education experience
## 1.142049 1.142049
vif3 <- vif(model_numerical_3)
print(vif3)
## education age
## 1.023024 1.023024
Both models have VIF values ~1, which is a sign of no multicollinearity.
Let’s observe the correlations between wage and socio-economic characteristics of the individuals.
model_categorical <- lm(wage ~ ethnicity + region + gender + occupation + sector
+ union + married, data = CPS)
stargazer(model_categorical, type = "text", title = "Wage Vs. Categorical variables", align = TRUE)
##
## Wage Vs. Categorical variables
## ================================================
## Dependent variable:
## ---------------------------
## wage
## ------------------------------------------------
## ethnicityother 0.771
## (1.023)
##
## ethnicitycauc 1.623*
## (0.893)
##
## regionsouth -0.824*
## (0.436)
##
## gendermale 2.028***
## (0.434)
##
## occupationsales 0.761
## (0.893)
##
## occupationworker 0.188
## (0.704)
##
## occupationoffice 1.388**
## (0.678)
##
## occupationtechnical 4.752***
## (0.669)
##
## occupationmanagement 5.691***
## (0.794)
##
## sectormanufacturing 0.714
## (1.033)
##
## sectorother -0.564
## (1.003)
##
## unionyes 2.018***
## (0.531)
##
## marriedyes 0.608
## (0.413)
##
## Constant 4.392***
## (1.480)
##
## ------------------------------------------------
## Observations 534
## R2 0.261
## Adjusted R2 0.243
## Residual Std. Error 4.471 (df = 520)
## F Statistic 14.162*** (df = 13; 520)
## ================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Since it is not an entire model of the relationship we are interested in, we can’t make any conclusions about the effect of these variables on wages. However, it is valuable to make preliminary observations on how these common cause confounders correlated with the wages.
As we can see:
Being a male is indeed quite likely to be associated with higher wages
Being of caucausian ethnic origin can be associated with higher wages
Living in the South can be associated with lower wages
Having an occupation in the technical and managerial positions can be associated with higher wages
Being a union member can be associated with higher wages
This check is consistent with the assumptions we had about the confounding variables before. So far the occupation is the strongest predictor of wage level among other confounders. It might also indicate that education indeed translates into higher wages, because technical and manegerial positions usually require higher level of education. At the same time, there is some association with the social position of the individual: ethnicity, location, gender and union membership are also associated with wage level.
Let’s run the models with the confounders included. We don’t include age due to its high collinearity with experience.
# num + gender
model_full1 <- lm(wage ~ education + experience + gender, data = CPS)
# num + ethnicity
model_full2 <- lm(wage ~ education + experience + ethnicity, data = CPS)
# num + gender + ethnicity
model_full3 <- lm(wage ~ education + experience + gender + ethnicity, data = CPS)
stargazer(model_full1, model_full2, model_full3, type = "text", title = "Wage Level predictors", align = TRUE)
##
## Wage Level predictors
## ===========================================================================================
## Dependent variable:
## -----------------------------------------------------------------------
## wage
## (1) (2) (3)
## -------------------------------------------------------------------------------------------
## education 0.941*** 0.915*** 0.930***
## (0.079) (0.082) (0.080)
##
## experience 0.113*** 0.105*** 0.113***
## (0.017) (0.017) (0.017)
##
## gendermale 2.338*** 2.356***
## (0.388) (0.388)
##
## ethnicityother -0.441 -0.622
## (1.054) (1.020)
##
## ethnicitycauc 0.409 0.336
## (0.923) (0.893)
##
## Constant -6.505*** -5.041*** -6.576***
## (1.210) (1.403) (1.382)
##
## -------------------------------------------------------------------------------------------
## Observations 534 534 534
## R2 0.253 0.205 0.257
## Adjusted R2 0.249 0.199 0.250
## Residual Std. Error 4.454 (df = 530) 4.599 (df = 529) 4.451 (df = 528)
## F Statistic 59.885*** (df = 3; 530) 34.131*** (df = 4; 529) 36.527*** (df = 5; 528)
## ===========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
In these models we have added social confounders that are hypethesized to be important:
Education and experience remain to be statistically significnant predictors of wage level in all models.
Including gender also shows a rather strong statistically significant relationship: being a male might be associated with higher wages.
Ethnicity is not found to be significant.
Now, let’s examine the model with a categorical variable for occupation included. We expect that this variable will be a significant predictor of wage level, since it is a common cause confounder for the relationship between education and wages.
model_full4 <- lm(wage ~ education + experience + gender + occupation, data = CPS)
stargazer(model_full1, model_full4, type = "text", title = "Wage Level predictors", align = TRUE)
##
## Wage Level predictors
## ====================================================================
## Dependent variable:
## -----------------------------------------------
## wage
## (1) (2)
## --------------------------------------------------------------------
## education 0.941*** 0.717***
## (0.079) (0.099)
##
## experience 0.113*** 0.103***
## (0.017) (0.017)
##
## gendermale 2.338*** 1.965***
## (0.388) (0.416)
##
## occupationsales -0.199
## (0.862)
##
## occupationworker 1.427**
## (0.612)
##
## occupationoffice 0.577
## (0.665)
##
## occupationtechnical 2.818***
## (0.738)
##
## occupationmanagement 3.840***
## (0.806)
##
## Constant -6.505*** -4.662***
## (1.210) (1.397)
##
## --------------------------------------------------------------------
## Observations 534 534
## R2 0.253 0.302
## Adjusted R2 0.249 0.291
## Residual Std. Error 4.454 (df = 530) 4.327 (df = 525)
## F Statistic 59.885*** (df = 3; 530) 28.351*** (df = 8; 525)
## ====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
In the last model “services” occupation is designated as a reference category. The coefficients for the other categories are interpreted as the difference in wage level compared to “services”. Since the “services” had the lowest median and the interquartile range, it is a good fit for a reference category as the group with lowest earnings. As expected, technical and managerial positions are associated with higher wages, increasing the wage by ~2.8 and ~3.8 dollars respectively on average. SO far, this relationship is higher than education, experience and gender.
The model with occupation included shows a jump in R squared (from 0.253 to 0.302) and significant change in the coefficents for education and gender: the magnitude of both coefficents decreased after controlling for the occupation.
There is a chance of collinearity between occupation and education, that should be checked:
vif4 <- vif(model_full4)
print(vif4)
## GVIF Df GVIF^(1/(2*Df))
## education 1.910504 1 1.382210
## experience 1.188824 1 1.090332
## gender 1.227183 1 1.107783
## occupation 2.056254 5 1.074751
Occupation shows low multicollinearity, which is acceptable.
# install.packages("lmtest")
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.4.1
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.1
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(model_full1)
##
## studentized Breusch-Pagan test
##
## data: model_full1
## BP = 9.0939, df = 3, p-value = 0.02807
bptest(model_full4)
##
## studentized Breusch-Pagan test
##
## data: model_full4
## BP = 22.39, df = 8, p-value = 0.004242
Breusch-Pagan test for heteroscedasticity shows that both models have heteroscedasticity, which means that the standard errors are not constant. It might be a sign of omitted variables.
plot(model_full4$fitted.values, model_full4$residuals,
main = "Residuals vs Fitted: Model with Gender & Occupation",
xlab = "Fitted Values",
ylab = "Residuals")
abline(h = 0, col = "red")
Let’s now add new confounders to the model 4.
model_full5 <- lm(wage ~ education + experience + gender + occupation + sector, data = CPS)
stargazer(model_full4 ,model_full5, type = "text", title = "Wage Level predictors", align = TRUE)
##
## Wage Level predictors
## =====================================================================
## Dependent variable:
## ------------------------------------------------
## wage
## (1) (2)
## ---------------------------------------------------------------------
## education 0.717*** 0.710***
## (0.099) (0.099)
##
## experience 0.103*** 0.100***
## (0.017) (0.017)
##
## gendermale 1.965*** 2.025***
## (0.416) (0.419)
##
## occupationsales -0.199 -0.284
## (0.862) (0.862)
##
## occupationworker 1.427** 0.907
## (0.612) (0.679)
##
## occupationoffice 0.577 0.529
## (0.665) (0.665)
##
## occupationtechnical 2.818*** 2.723***
## (0.738) (0.740)
##
## occupationmanagement 3.840*** 3.754***
## (0.806) (0.807)
##
## sectormanufacturing 0.479
## (0.996)
##
## sectorother -0.539
## (0.973)
##
## Constant -4.662*** -4.029**
## (1.397) (1.712)
##
## ---------------------------------------------------------------------
## Observations 534 534
## R2 0.302 0.306
## Adjusted R2 0.291 0.293
## Residual Std. Error 4.327 (df = 525) 4.321 (df = 523)
## F Statistic 28.351*** (df = 8; 525) 23.085*** (df = 10; 523)
## =====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Adding sector as a predictor does not seem to produce any significant changes in the model. The coefficients for other predictors remain quite similar, while sector dummy variables are not significant.
model_full6 <- lm(wage ~ education + experience + gender + occupation + region, data = CPS)
stargazer(model_full4, model_full6, type = "text", title = "Wage Level predictors", align = TRUE)
##
## Wage Level predictors
## ====================================================================
## Dependent variable:
## -----------------------------------------------
## wage
## (1) (2)
## --------------------------------------------------------------------
## education 0.717*** 0.694***
## (0.099) (0.100)
##
## experience 0.103*** 0.101***
## (0.017) (0.016)
##
## gendermale 1.965*** 1.990***
## (0.416) (0.415)
##
## occupationsales -0.199 -0.137
## (0.862) (0.861)
##
## occupationworker 1.427** 1.430**
## (0.612) (0.611)
##
## occupationoffice 0.577 0.638
## (0.665) (0.664)
##
## occupationtechnical 2.818*** 2.825***
## (0.738) (0.737)
##
## occupationmanagement 3.840*** 3.832***
## (0.806) (0.804)
##
## regionsouth -0.791*
## (0.417)
##
## Constant -4.662*** -4.139***
## (1.397) (1.421)
##
## --------------------------------------------------------------------
## Observations 534 534
## R2 0.302 0.306
## Adjusted R2 0.291 0.295
## Residual Std. Error 4.327 (df = 525) 4.316 (df = 524)
## F Statistic 28.351*** (df = 8; 525) 25.726*** (df = 9; 524)
## ====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Adding region (which shows either living in the North or in the South of the US) as a predictor shows that living in the South is associated with lower wages. It has also slighly altered the magnitude of other predictors, but their coefficients remain significant.
model_full7 <- lm(wage ~ education + experience + gender + occupation + region + union, data = CPS)
stargazer(model_full6, model_full7, type = "text", title = "Wage Level predictors", align = TRUE)
##
## Wage Level predictors
## =====================================================================
## Dependent variable:
## ------------------------------------------------
## wage
## (1) (2)
## ---------------------------------------------------------------------
## education 0.694*** 0.672***
## (0.100) (0.099)
##
## experience 0.101*** 0.094***
## (0.016) (0.017)
##
## gendermale 1.990*** 1.845***
## (0.415) (0.415)
##
## occupationsales -0.137 0.173
## (0.861) (0.860)
##
## occupationworker 1.430** 1.349**
## (0.611) (0.607)
##
## occupationoffice 0.638 0.801
## (0.664) (0.661)
##
## occupationtechnical 2.825*** 2.880***
## (0.737) (0.731)
##
## occupationmanagement 3.832*** 4.148***
## (0.804) (0.805)
##
## regionsouth -0.791* -0.689*
## (0.417) (0.415)
##
## unionyes 1.517***
## (0.508)
##
## Constant -4.139*** -4.014***
## (1.421) (1.411)
##
## ---------------------------------------------------------------------
## Observations 534 534
## R2 0.306 0.318
## Adjusted R2 0.295 0.305
## Residual Std. Error 4.316 (df = 524) 4.284 (df = 523)
## F Statistic 25.726*** (df = 9; 524) 24.394*** (df = 10; 523)
## =====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Being a union member appears to have a significant posaitive relationship with the wage level (as was theorized before constructing the model). Adding predictor for union membership has also slightly altered the magnitude of other predictors, but their coefficients still remain significant.
Interestingly, the magnitude affected the most is the occupation: the coefficients for managerial positions have increased.
model_full8 <- lm(wage ~ education + experience + gender + occupation + region + union + married, data = CPS)
stargazer(model_full7, model_full8, type = "text", title = "Wage Level predictors", align = TRUE)
##
## Wage Level predictors
## ======================================================================
## Dependent variable:
## -------------------------------------------------
## wage
## (1) (2)
## ----------------------------------------------------------------------
## education 0.672*** 0.672***
## (0.099) (0.099)
##
## experience 0.094*** 0.090***
## (0.017) (0.017)
##
## gendermale 1.845*** 1.852***
## (0.415) (0.415)
##
## occupationsales 0.173 0.099
## (0.860) (0.865)
##
## occupationworker 1.349** 1.316**
## (0.607) (0.608)
##
## occupationoffice 0.801 0.780
## (0.661) (0.662)
##
## occupationtechnical 2.880*** 2.822***
## (0.731) (0.735)
##
## occupationmanagement 4.148*** 4.098***
## (0.805) (0.808)
##
## regionsouth -0.689* -0.697*
## (0.415) (0.415)
##
## unionyes 1.517*** 1.487***
## (0.508) (0.510)
##
## marriedyes 0.338
## (0.410)
##
## Constant -4.014*** -4.124***
## (1.411) (1.418)
##
## ----------------------------------------------------------------------
## Observations 534 534
## R2 0.318 0.319
## Adjusted R2 0.305 0.305
## Residual Std. Error 4.284 (df = 523) 4.286 (df = 522)
## F Statistic 24.394*** (df = 10; 523) 22.224*** (df = 11; 522)
## ======================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Being married does not appear to be a significant predictor of wage level in this model, which is consistent with our preliminary assumptions.
Let’s now compare the model with statistically significant predictors and confounders and the model with all confounders included:
model_full9 <- lm(wage ~ education + experience + gender + occupation + region + union + ethnicity + married + sector, data = CPS)
stargazer(model_full7, model_full9, type = "text", title = "Wage Level predictors", align = TRUE)
##
## Wage Level predictors
## ======================================================================
## Dependent variable:
## -------------------------------------------------
## wage
## (1) (2)
## ----------------------------------------------------------------------
## education 0.672*** 0.655***
## (0.099) (0.101)
##
## experience 0.094*** 0.087***
## (0.017) (0.017)
##
## gendermale 1.845*** 1.939***
## (0.415) (0.418)
##
## occupationsales 0.173 -0.087
## (0.860) (0.868)
##
## occupationworker 1.349** 0.687
## (0.607) (0.678)
##
## occupationoffice 0.801 0.707
## (0.661) (0.663)
##
## occupationtechnical 2.880*** 2.649***
## (0.731) (0.742)
##
## occupationmanagement 4.148*** 3.977***
## (0.805) (0.810)
##
## regionsouth -0.689* -0.564
## (0.415) (0.419)
##
## unionyes 1.517*** 1.601***
## (0.508) (0.512)
##
## ethnicityother -0.230
## (0.991)
##
## ethnicitycauc 0.608
## (0.869)
##
## marriedyes 0.297
## (0.410)
##
## sectormanufacturing 0.562
## (0.991)
##
## sectorother -0.478
## (0.965)
##
## Constant -4.014*** -3.872**
## (1.411) (1.840)
##
## ----------------------------------------------------------------------
## Observations 534 534
## R2 0.318 0.326
## Adjusted R2 0.305 0.307
## Residual Std. Error 4.284 (df = 523) 4.278 (df = 518)
## F Statistic 24.394*** (df = 10; 523) 16.738*** (df = 15; 518)
## ======================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
It seems that controlling for ethnicity, sector of occupation and being married does not affect the model. The coefficients for the other predictors remain quite similar, while the new predictors are not significant. In addition, the adjusted R squared value is almost the same in both models. The same applies to Residual Standard Error, which is quite similar in both models.
library(car)
print(vif(model_full7))
## GVIF Df GVIF^(1/(2*Df))
## education 1.948499 1 1.395887
## experience 1.220804 1 1.104900
## gender 1.245543 1 1.116039
## occupation 2.177498 5 1.080926
## region 1.036379 1 1.018027
## union 1.108663 1 1.052931
VIF values are quite low, which indicates that multicollinearity is not an issue in this model, after excluding age.
plot(model_full7$fitted.values, residuals(model_full7),
xlab = "Fitted Values", ylab = "Residuals",
main = "Residuals vs Fitted Values")
abline(h = 0, col = "red", lty = 2)
The residuals vs. fitted values plot shows no signs of non-linear
distribution of residuals, which is a sign that the model linearity is
correctly specified with all introduced variables.
However, due to the widening of the of the scatter, there is a chance that the homoscedasticity assumption is not actually met.
When we previously plotted the distribution of wage, it was right-skewed, which might have caused the residuals to be heteroscedastic. Let’s apply the log transformation to the dependent variable and see if it improves the model fit.
CPS$log_wage <- log(CPS$wage)
log_wage_plot <- ggplot(CPS, aes(x = log_wage)) +
geom_density(fill = "skyblue", color = "black") +
labs(title = "Density Plot of Log-Transformed Wage",
x = "Log of Wage",
y = "Density") +
theme_minimal()
ggplotly(log_wage_plot, width = 800, height = 600)
The distribution of log-transformed wage is more symmetrical and closer to the normal distribution, which might help to solve the problem of heteroscedasticity.
model_full7_log <- lm(log_wage ~ education + experience + gender + occupation + region + union, data = CPS)
stargazer(model_full7, model_full7_log, type = "text", title = "Wage Level predictors", align = TRUE)
##
## Wage Level predictors
## ===========================================================
## Dependent variable:
## ----------------------------
## wage log_wage
## (1) (2)
## -----------------------------------------------------------
## education 0.672*** 0.069***
## (0.099) (0.010)
##
## experience 0.094*** 0.011***
## (0.017) (0.002)
##
## gendermale 1.845*** 0.208***
## (0.415) (0.042)
##
## occupationsales 0.173 0.054
## (0.860) (0.086)
##
## occupationworker 1.349** 0.200***
## (0.607) (0.061)
##
## occupationoffice 0.801 0.187***
## (0.661) (0.066)
##
## occupationtechnical 2.880*** 0.361***
## (0.731) (0.073)
##
## occupationmanagement 4.148*** 0.404***
## (0.805) (0.081)
##
## regionsouth -0.689* -0.106**
## (0.415) (0.042)
##
## unionyes 1.517*** 0.206***
## (0.508) (0.051)
##
## Constant -4.014*** 0.639***
## (1.411) (0.142)
##
## -----------------------------------------------------------
## Observations 534 534
## R2 0.318 0.348
## Adjusted R2 0.305 0.336
## Residual Std. Error (df = 523) 4.284 0.430
## F Statistic (df = 10; 523) 24.394*** 27.973***
## ===========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The residual standard error is much lower in the model with log-transformed wage, which is a sign of a better fit for the data.
plot(model_full7_log$fitted.values, residuals(model_full7_log),
xlab = "Fitted Values", ylab = "Residuals",
main = "Residuals vs Fitted Values")
abline(h = 0, col = "red", lty = 2)
The visual check of the transformed model shows that the residuals are
evenly distributed around the zero line, which is a sign of
homoscedasticity.
qqnorm(residuals(model_full7_log))
qqline(residuals(model_full7_log))
The Q-Q plot shows that the residuals are normally distributed, which is
a sign that the model assumptions are met.